

************************************************************
************************************************************
***          Marriage and Happiness                      ***
***     2: Preparing the data for analysis               ***
***		    Josef Brderl, Volker Ludwig           	     *** 	   
***		            March 2011                           *** 	   
************************************************************
************************************************************

* Data: SOEP 1984-2009 v26

* We start with file "Happiness1.dta"
* This file was retrieved from the SOEP with "Happiness 1 Retrieval.do"

*************************************
** Preliminaries    *****************
*************************************
clear 
set more off

* Load data
use "K:\Vorlesung PDA\Stata Beispiele\Happiness1.dta", clear


*****************************************************************
***       Recode variables                                   ****
*****************************************************************

*** Define missings
* In the SOEP, there are three missing codes:
* -1 no answer / dont know: item-nonresponse
* -2 does not apply (be careful with -2, these are mostly due to loops in the questionnaire)
* -3 a given value was found to be implausible and was finally deleted
mvdecode _all, mv(-3 -1=. \ -2=.a)   //code missing values, -2 don't apply gets extra code 

* This is a useful auxiliary variable
*  !!! This variable has to be computed anew, if one drops person-years !!!
bysort id (year): gen pynr = _n   //person-years numbered consecutively (within person)

* Tabulating the central variables (and getting N*T and N)
tab happy, missing
tab happy   if pynr==1, missing
tab marstat, missing
tab marstat if pynr==1, missing

*** Household income
tab hhinc if hhinc<10
recode hhinc 0=.          //this value is unplausible
gen loghhinc = ln(hhinc)

*** Dummy for woman
gen woman = sex-1


*****************************************************************
***       Creating the central time-yarying variable         ****
*****************************************************************

*** Generate dummy for marriage (0=never-married 1=married) 
gen marry = marstat
recode marry 2=0 1=1 *=.  //set single to 0; married to 1; divorced, separated, widowed to missing 

*** Compute marriage duration
bysort id: egen marriageyear = min(year) if marry==1  // define year in state marriage for first time
gen yrsmarried = year - marriageyear if marry==1      // duration is time since
replace yrsmarried = 0 if marry==0                    // duration is 0 if person is never-married
gen yrsmarried2 = yrsmarried*yrsmarried               // generate squared term

* Note: Here we constructed simplified versions of marital status and marriage duration variables
* which is perhaps not sufficient when conducting a "real" study. The problem is that people might 
* change their status between waves, e.g. from married to divorced to remarried. This is especially 
* problematic if panels have many and long gaps. To avoid this kind of measurement error, one 
* would have to go to the spell files (monthly spells stored in biomarsm).


*****************************************************************
***       Define the estimation sample                       ****
*****************************************************************

* Given the goal of our analysis, we restrict the estimation sample such,
* that only persons are included, who were single when first observed.
* Persons, who were married when first observed, are discarded altogether!
* This differs markedly from what one is used from cross-sectional data analysis.
* However, when using the within logic it makes sense: Only persons who change from
* single to married provide within information!
* The selection criteria in detail:
*** Marital status:          only persons that were unmarried when first observed are kept
*** Marital status:          all person-years after first marital dissolution are excluded 
*** Marital status:          person-years single again after marriage are excluded
*** Missings:                person-years with any missing information are excluded
*** Number of person-years:  persons with only one valid person-year are excluded

* Exclude person-years with missings on one or more variables 
gen help=0
replace help=1 if missing(happy,marstat,age,hhinc)
keep if help==0
drop help pynr
bysort id (year): gen pynr = _n            //calculate anew after each case exclusion

* All person-years after first separation, divorce or widowhood (dissolution) are excluded 
gen help=0
replace help=1 if marstat>2               //flag person-years of divorced, separated, widowed
bysort id (year): replace help=sum(help)  //for each person, flag all following person-years    
tab help
keep if help==0                           //all person years after first dissolution are dropped
drop help pynr
bysort id (year): gen pynr = _n           //calculate anew after each case exclusion

* Only persons that were unmarried when first observed are kept
gen help=0
replace help=1 if marry==1 & pynr==1       // =1 for the first observation of those married
bysort id (year): replace help = sum(help) // =1 for all observations of those married
tab help
keep if help==0                            //all observations of those married are dropped
drop help pynr
bysort id (year): gen pynr = _n            //calculate anew after each case exclusion

* A check
tab marry, missing
tab marry if pynr==1,missing


* According to the laws of logic we would be done now.
* However, (panel) data do not follow the laws of logic perfectly, they are full of errors.
* Therefore, we should check, whether we really have only two sequences in the data:
*    (a) always single, (b) single some time - married some time
* Or could it be that somebody becomes single after marriage?
gen help=0
replace help=1 if marry==1                 //flag person-years in marriage
bysort id (year): replace help=sum(help)   //for each person, flag following person-years
gen help1 = (help>0 & marry==0)            //help1==1 if a person becomes single again
tab help1                                  //obviously such errors are in the data
bysort id (year): replace help1=sum(help1) //we flag all following person-years
tab help1
keep if help1==0                           //and drop them
drop help help1 pynr
bysort id (year): gen pynr = _n            //calculate anew after each case exclusion

* A check
tab marry, missing
tab marry if pynr==1,missing


* Finally, exclude persons with less than two observations
bysort id:   gen pycount = _N     //# of person-years (within person)
tab pycount if pynr==1
keep if pycount > 1               //only those with 2 or more person-years are kept


*****************************************************************
***       Save the data set                                  ****
*****************************************************************

keep id year happy marry yrsmarried yrsmarried2 age loghhinc woman
order id year marry yrsmarried yrsmarried2 age loghhinc woman
sort id year
compress
* save "K:\Vorlesung PDA\Stata Beispiele\Happiness2.dta", replace

